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We show how rate-distortion theory provides a mechanism for automated theory building by nat- 
urally distinguishing between regularity and randomness. We start from the simple principle that 
model variables should, as much as possible, render the future and past conditionally independent. 
From this, we construct an objective function for model making whose extrema embody the trade-off 
between a model's structural complexity and its predictive power. The solutions correspond to a 
hierarchy of models that, at each level of complexity, achieve optimal predictive power at minimal 
cost. In the limit of maximal prediction the resulting optimal model identifies a process's intrinsic 
organization by extracting the underlying causal states. In this limit, the model's complexity is given 
by the statistical complexity, which is known to be minimal for achieving maximum prediction. Ex- 
amples show how theory building can profit from analyzing a process's causal compressibility, which 
is reflected in the optimal models' rate-distortion curve — the process's characteristic for optimally 
balancing structure and noise at different levels of representation. 

PACS numbers: 02.50.-r 89.70.+C 05.45.Tp 02.50.Ey 



I. INTRODUCTION 

Progress in science is often driven by the discovery of 
novel patterns. Historically, physics has relied on the 
creative mind of the theorist to articulate mathematical 
models that capture nature's regularities in physical prin- 
ciples and laws. But the last decade has witnessed a new 
era in collecting truly vast data sets. Examples include 
contemporary experiments in particle physics [l[ and as- 
tronomy [2], but range to genomics, automated language 
translation Q, and web social organization ^4jj. In all 
these, the volume of data far exceeds what any human 
can analyze directly by hand. 

This presents a new challenge — automated pattern dis- 
covery and model building. A principled understanding 
of model making is critical to provide theoretical guid- 
ance for developing automated procedures. In this Let- 
ter, we show how basic information-theoretic optimality 
criteria provide a method for automatically constructing 
a hierarchy of models that achieve different degrees of 
abstraction. Importantly, we show that in appropriate 
limits the method recovers a process's causal organiza- 
tion. Without this connection, it would be only another 
approach to statistical inference, with its own ad hoc as- 
sumptions about the character of natural pattern. 

Our starting point is the observation that natural sys- 
tems store, process, and produce information — they com- 
pute intrinsically H, @, 0| ■ Theory building, then, faces 
the challenge of extracting from that information the 
structures underling its generation. Any physical the- 
ory delineates mechanism from randomness by identify- 
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ing what part of an observed phenomenon is due to the 
underlying process's structure and what is irrelevant. Ir- 
relevant parts are considered noise and typically modeled 
probabilistically. Successful theory building therefore de- 
pends centrally on deciding what is structure and what 
is noise; often, an implicit distinction. 

What constitutes a good theory, though? Which in- 
formation is relevant? One can answer this question for 
time series prediction: Information about the future of 
the time series is relevant. Beyond forecasting, though, 
models are often put to the test by assessing how well 
they predict new data and, hence, it is of general im- 
portance that a model capture information which aids 
prediction. Typically, there are many models that ex- 
plain a given data set, and between two models that are 
equally predictive, one favors the simpler, smaller, less 
structurally complex model [8, 9]. However, a more com- 
plex model can achieve smaller prediction error than a 
less complex model. The trade-off between model com- 
plexity and prediction error is tantamount to finding a 
distinction between causal structure and noise. 

The trade-off between assigning a causal mechanism 
to the occurrence of an event or explaining the event as 
being merely random has a long history, but how one 
implements the trade-off is still a very active topic. Non- 
linear time series analysis [13, [HI EIj to take one ex- 
ample, attempts to account for long-range correlations 
produced by nonlinear dynamical systems — correlations 
not adequately modeled by assumptions such as linear- 
ity and independent, identically distributed (i.i.d.) data. 
Success in this endeavor requires directly addressing the 
notion of structure and pattern [l(J, HH . 

Examination of the essential goals of prediction led to a 
principled definition of structure that captures a dynam- 
ical system's causal organization in part by discovering 
the underlying causal states [H, IE 0] ■ ^ n computational 
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mechanics a process P(X,X) is viewed as a communi- 
cation channel 0, EH : it transmits information from the 

past X= ■ ■ ■ X_ 3 X_2^-i to the future X= X X 1 X 2 ■ ■ ■ 
by storing it in the present. For the purpose of forecasting 

the future two different pasts, say x and x , are equiva- 
lent if they result in the same prediction [f| . In general 
this prediction is probabilistic, given by the conditional 

future distribution P(X \x). The resulting equivalence 

relation x ~ x groups all histories that give rise to the 
same conditional future distribution: 



e(x) 



{x 



Pr(X 



Pr(X \x )}. 



(1) 



The resulting partition of the space X of pasts defines 

the process's causal states S = P(X,X)/ ~- 

The causal states constitute a model that is maximally 
predictive by means of capturing all the information that 
the past of a time series contains about the future. As 
a result, knowing the causal state renders past and fu- 
ture conditionally independent, a property we call causal 
shielding, because the causal states have the Markovian 
property that they shield past and future [7J : 



P(X,X \S)=P(X \S)P(X \S), 



(2) 



where 5e5. This is related to the fact that the causal- 
state partition is optimally predictive. To see this, note 

that Eq. © implies P(X | X,S) = P{X \S). Further- 
more, note that, by definition, for any partition 7Z of X 
with states 1Z, when the past is known, then the future 
distribution is not altered by the history-space partition- 



ing: 



P(X \X,1Z) = P(X | X) 



(3) 



This implies for the causal states that P(X \ X,S) — 

P(X | X) and thus P(X \S) = P(X \ X). There- 
fore, causal shielding is equivalent to the fact [7fl that the 

causal states capture all of the information that 

is shared between past and future: J[5;X] = I[X;X], 
the process's excess entropy E or predictive information 
[lH Il6l . and references therein] . 

The causal states are unique and minimal sufficient 
statistics for time series prediction, capturing all of a 
process's predictive information at maximum efficiency 
The causal-state partition has the smallest statis- 
tical complexity, C M := H(S) < H[1Z], compared to all 
other equally predictive partitions 7Z. C M measures the 
minimal amount of information that must be stored in 
order to communicate all of the excess entropy from the 
past to the future. Briefly stated, the causal states serve 
as the basis against which alternative models should be 
compared. 



II. CONSTRUCTING CAUSAL MODELS USING 
RATE-DISTORTION THEORY 

There are many scenarios in which one does not need to 
or explicitly does not want to capture all of the predictive 
information. How can we approximate the causal states 
in a controlled way? 

In this Letter, we show how to systematically construct 
smaller models, which are necessarily less predictive, but 
which arc optimal in the sense that they capture, at a 
fixed model complexity, the maximum possible amount 
of predictive information. Importantly, in the limit that 
removes the constraint on model complexity, our method 
retrieves the exact causal-state partition. 

Appealing to information theory again, we frame this 
in terms of communicating a model over a channel with 
limited capacity. Rate-distortion theory [17} provides a 
principled way to find a lossy compression of an infor- 
mation source such that the resulting code is minimal at 
fixed fidelity to the original signal. 

The compressed representation, denote it 7Z, is in gen- 
eral specified by a probabilistic map P(lZ\x) from the 
input message, here the past x , to code words, here the 
model's states 1Z with values p £ TZ. In contrast, Eq. 
([1]) specifies models that are described by a deterministic 
map from histories to states: The causal states a £ S in- 
duce a deterministic partition of X , as one can show 
that P(er|a;) = S^.^.. The mapping P(7Z\x) specifies a 

model, and the coding rate I[X; TZ] measures its complex- 
ity, which in turn is related to its statistical complexity 

via I[X\ K] = H[TZ] - H[1Z\ X] = C^TZ) - H[TZ\ X}. For 
deterministic partitions the statistical complexity and 

the coding rate are equal, because then H[1Z\ X] = 0. 
However, for more general, nondeterministic partitions, 

H[1Z\ X] 7^ 0, meaning that the probabilistic nature of 
the mapping curtails some of the model's complexity, and 

the coding rate captures this. 

To illustrate this point, consider the extreme of uni- 
form assignments: P(1Z\x ) = 1/c, for any given x , where 
c — \7Z\. In this case, even if there are many states — large 
statistical complexity H[TZ] = log 2 (c) — they are indistin- 
guishable: P(x \1Z) = (P(x \x)) p ^y for all TZ, due to 
the large uncertainty about the state, given the past. 
This is reflected in H[1Z\ X] = log 2 (c). In effect, the 
model has only one state (the average (P(x \ x )) P r^\) 
and its statistical complexity vanishes, which is reflected 

in the coding rate: I[X;7Z] = 0. 

Rate-distortion theory allows us to back away from the 
best (causal-state) representation toward less complex 
models by controlling the coding rate: Simpler models 
are distinguished from more complex ones by the fact 
that they can be transmitted more concisely. However, 
less complex models are also associated with a larger er- 
ror. Rate-distortion theory quantifies the loss by a distor- 
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tion function d(x; p). The coding rate is then minimized 
[Tij over the assignments P(1Z\ X) at fixed average dis- 
tortion D[X;1Z] = (d(x;pf 



P(x,p) 

In building predictive models, the loss should be mea- 
sured by how much the resulting models deviate from 
accurate prediction. We take the shielding property, Eq. 
@, of the causal-state partition as the goal for any pre- 
dictive model. This condition is equivalent to the state- 
ment that the excess entropy conditioned on the model 
states 1Z: 



I[X;X\K] = ' 



PQ, x \p) 
P(x\p)P(x\p) 



(4) 



vanishes for the causal-state partition: I[X',X \S] = 0. 
This gives us our distortion measure: 



d(x;p) = ( log 



P(x , x \p) 
P(x\p)P(x \p) 



(5) 



P ( X | X ) 



From Eq. ([3]) this is the same as the relative entropy be- 
tween the conditional future distributions given the past 
and those given the model states p: 



V(P(x \x)\\P{x \ P ) =( log 



P(x \x) 
P(x \p) 



(0) 



P(x|x) 



Altogether, we solve the constrained optimization prob- 
lem: 



mm 



(l[X;K]+pI[X-,X\K]) , 



(J) 



where the Lagrange multiplier ft controls the trade-off 
between model complexity and prediction error; i.e., the 
balance between structure and noise. 

The conditional excess entropy of Eq. ^ is the dif- 
ference between the process's excess entropy and the in- 
formation I[1Z; X] that the model states contain about 
the future: I[X;X \K] = I[X',X] - I[K',X], due to Eq. 

©. The excess entropy I[X; X] is a property intrinsic to 
the process, however, and so not dependent on the model. 
Therefore, the optimization problem in Eq. ([7]) is equiva- 
lent to maximizing the information that the model states 
carry about the future while minimizing information kept 
about the past. This maps directly onto the informa- 
tion bottleneck (IB) method [l8[ — here the future data is 
IB's "relevant" quantity with respect to which the past 
is summarized. 

In any case, the solution to the optimization principle 
is given by (cf. [3): 

POO 



opt 



Z{x,(3) 



(8) 



E{p,x)=v(p{X\x)\\P(X \ P ) 



(9) 



P(X \p) = p^r J2 P ^ l*) P M*)P(*) , and (10) 



P(p) = P{p\x)P(x) 



(11) 



Eqs. ([5|)- (fTT1) must be solved self-consistently, and this 
can be done numerically (l8j . 

Eq. ([5]) specifies a family of models parametrized by (3 
with the form of Gibbs distributions. Within an analogy 
to statistical mechanics [l9j], f3 corresponds to the inverse 

temperature, E is the energy, and Z = (e~P E { p ' x >) 

\ /p(p) 

the partition function. Finally, note that for linear 
Gaussian-distributed random variables the optimal lin- 
ear map can be computed analytically (2p| . These re- 
sults can be carried over to the temporal setting that 
concerns us here for linear Gaussian processes following 
a rate-distortion approach similar to the above (2lj . 



III. RETRIEVING THE CAUSAL-STATE 
PARTITION 

A key result is that these optimal solutions retrieve 
the causal-state partition in the limit (3 — > oo, which em- 
phasizes prediction accuracy p2l . detailed proof]. To see 
this first note that as (3 — > oo, the optimal assignment 
becomes deterministic P ov t{p\ x ) — > & pp ,(^;y where the 

state p*(x) to which a past is assigned is the one mini- 
mizing energy, Eq. ([9]). Now, that function is zero when 
the future probability conditioned on the state equals 
the future probability conditioned on the past. This 
means that, in the limit, all pasts with equal conditional 
future probability distributions will be assigned to the 

same state with P(X \x) = P(X \p*{ x j), for all those 
pasts assigned to the state p*(x). This yields exactly 
the causal-state partition given by the equivalence rela- 
tion that arises from Eq. ([1]). 

Hence, one finds in this limit what we have argued is 
the goal of predictive modeling. Moreover, what was oth- 
erwise an ad hoc optimization method has been given a 
structural grounding in that it captures a process's in- 
trinsic causal architecture. Recall that the model com- 
plexity C p of the causal-state partition is minimal among 
the optimal predictors and so not necessarily equal to the 

maximum value of the coding rate I[X; H] < H [X]- 



where 



IV. FINDING APPROXIMATE CAUSAL 
REPRESENTATIONS: CAUSAL 
COMPRESSIBILITY 

While the causal-state partition captures all of the 
predictive information, less complex models can be con- 
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FIG. 1: Trading structure off against noise using optimal causal inference (OCI): Rate-distortion curve for the SNS process, 

coding rate I[X',Ti] versus distortion I[X',X [R]- Dashed lines mark maximum values: past entropy H[X ] (horizontal) and 

excess entropy (vertical). The causal-state limit for infinite sequences is shown in the upper left (solid box). (Inset) 

SNS conditional future distributions P(X \x ): OCI six-state reconstruction (six crosses), true causal states (six boxes), and 
three-state approximation (three circles). Annealing rate was a — 1.1. 



structed if one allows for larger distortion — accepting less 
predictive power. For all models in the optimal family, 
Eqs. (f8|)- (fTT|) . the original process is mapped to the best 
causal-state approximation, at fixed model complexity. 
And so we refer to the resulting method as optimal causal 
inference (OCI). Several examples are studied in [22j |. 

The nature of the trade-off embodied in Eq. ([7]) can 
be studied by evaluating the objective function at the 
optimum for each value of (3. The shape of the result- 
ing rate-distortion curve characterizes a process's causal 

compressibility via the interdependence between I[X',H\ 

and I[X; X [R]. Since the variation of the objective func- 
tion in Eq. ([7]) vanishes at the optimum, the curve's 

slope is 5I{X; TZ]/SD[X', TZ\ = —0. For a given process 
the rate-distortion curve determines what predictability 
the best model at a fixed complexity can achieve and, 
vice versa, how small a model can be made at fixed pre- 
dictability. Below the curve lie infeasible causal com- 
pression codes; above are feasible larger models that are 
no more predictive that those directly on the curve. In 
short, the rate-distortion curve determines how to opti- 
mally trade structure for noise. 

As an example, consider the simple nondeterministic 
source (SNS) — a hidden Markov process that specifies 
a binary information source with nontrivial statistical 
structure, including infinite-range correlations and an in- 
finite number of causal states [29| . 

The SNS's rate-distortion curve, calculated for pasts 
of length 5 and futures of length 2 is shown in Fig. [T] 
We computed the curve using a deterministic annealing 
scheme following [19j. One starts at a high tempera- 
ture (low 0) and slowly cools the system, waiting for it 
to equilibrate — iterating the self-consistent Eqs. (j8"j)- (fTTj) 
until convergence. At that point one continues by lower- 
ing the temperature (/3 <— a(3) by a fixed annealing rate 
a > 1 and equilibrating again. During this procedure, 



the number of effective states changes. Starting at high 
temperatures, all pasts are assigned to states that are all 
effectively the same state, as their predictions are equal. 
States are allowed to split at each temperature. One ob- 
serves the proliferation of more and more states as the 
temperature is lowered, until the causal states emerge in 
the zero-temperature limit. 

For the SNS the causal states for past and future 
strings of finite length are recovered by OCI (cross in 
upper left). For a comparison, there we also show the 
causal-state limit, which is calculated analytically for in- 
finite pasts and futures (solid box). 

The curve drops rapidly away from the finite causal- 
state model with six effective states, indicating that there 
is little predictive cost in using significantly smaller mod- 
els with successively fewer effective states. The curve 
then levels out below three states: smaller models incur 
a substantial increase in distortion (loss in predictability) 
while little is gained in terms of compression. Quantita- 
tively, specifying the best four-state model (at I[X', TZ] — 
1.92 bits) leads to 0.5% distortion, capturing 99.5% the 
SNS's excess entropy. The distortion increases to 2% for 
three states (1.43 bits), 9% for two states (0.81 bits), and 
100% for a single state (0 bits). Overall, the three-state 
model lies near a knee in the rate-distortion curve and 
this suggests that it is a good compromise between model 
complexity and predictability. 

The inset in Fig. [T]shows the reconstructed conditional 
future distributions for the optimal three-state and six- 

state models in the simplex P(X \x ). The six-state 
model (crosses) reconstructs the true causal-state condi- 
tional future distributions (boxes), calculated from ana- 
lytically known finite-sequence causal states. The figure 
illustrates why the three-state model (circles) is a good 
compromise: two of the three-state model's conditional 
future distributions capture the two more-distinct SNS 
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conditional future distributions, and its third one sum- 
marizes the remaining, less different, SNS conditional fu- 
ture distributions. 

With its intricate causal structure and nontrivial 
causal compressibility properties the SNS process is typi- 
cal of stochastic processes. Other frequently studied pro- 
cesses are not, however. Two classes are of particular 
interest due to their widespread use. On one extreme 
of randomness are the i.i.d. processes alluded to in the 
introduction, such as the biased coin — by definition, a 
completely random and unstructured source. For all i.i.d. 
processes the rate-distortion curve collapses to a single 
point at (0,0), indicating that they are wholly unpre- 
dictable and causally incompressible. This is easily seen 
by noting first that for i.i.d. processes the excess entropy 

/[AT; A] vanishes, since P(x \x) = P(x). Therefore, 
I[X;X \Ti\ < I[X;X] — vanishes, too. Second, the en- 
ergy function E(p, x) in the optimal assignments, Eq. 

(151), vanishes, since P(x \p) = {P(x \x)) = P(x). 

\ fP(X\p) 
The optimal assignment given by Eq. (jHJ) is therefore the 

uniform distribution and I[X;7£\\ P , = 0. (See Fig. 

H) 

At the other extreme are the predictively reversible pro- 
cesses for which 

P (*|z)=^,/(s)> f 12 ) 

where / is invertible, such as periodic processes. These 
processes have a rate-distortion curve that is a straight 
line, the negative diagonal. Note that P(x \p) = 

p (/ _1 (^)l/°) = p (*\p) and > therefore, I[X; X \TZ] = 

I[X',X] — I[X;7t\- The variational principle now reads 

5(1 - f3) I[X; K] = 0, which implies that j3 = 1. For 
these processes, the rate-distortion curve is the diagonal 
that runs from [0, C M ] (causal-state limit) to [E, 0], where 

E = I[X;X] is the excess entropy, due to Eq. (fTS]) and 
invertibihty. (See Fig. El) 
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FIG. 2: Schematic illustration of the causal incompressibil- 
ity of independent, identically distributed processes (square) 
and predictively reversible processes (straight line connecting 
circles) . 

This diagonal rate-distortion curve represents the 
worst possible case for causal compression. At each level, 



specifying the future to one bit higher accuracy costs us 
exactly one bit in model complexity. Processes in this 
class are thus not causally compressible. To be causally 
compressible, a process's rate-distortion curve must lie 
below the diagonal. The more concave the curve, the 
more causally compressible is the process. An extremely 
causally compressible process can be predicted to high 
accuracy with a model that can be encoded at a very low 
model cost. These are the processes that lie between the 
extremes of exact predictability and structureless ran- 
domness. 

These examples show how studying the hierarchy of 
optimal models, and the associated rate-distortion curve, 
allows one to learn about the causal compressibility of 
the process at hand, which serves to guide where the 
demarcation between structure and noise should lie. 



V. FINITE-SAMPLE FLUCTUATIONS 

As in statistical mechanics, we assumed so far that 

the distribution P(X,X) is given. And so, the above 
results bear on an intrinsic distinction between structure 
and noise for a process, unsullied by statistical sample 
fluctuations. 

However, when one builds a model from finite samples, 
the distributions must be estimated from the available 
data and so sample fluctuations must be taken into ac- 
count. Intuitively, limited data size sets a bound on how 
much we can consider to be structure without overfitting. 
It turns out that using [23|, the effects of finite data can 
be corrected, as we show in [22j . This connects the ap- 
proach taken here to statistical inference and machine 
learning, where model complexity control is designed to 
avoid overfitting due to finite-sample fluctuations; cf., 

e.g, 0, HE IE IE m. 

VI. CONCLUSION 

We showed how rate-distortion theory can be employed 
to find optimal causal models at varying degrees of ab- 
straction. Starting with the simple modeling principle of 
causal shielding, an objective function was constructed 
that embodied the trade-off between model complexity 
and predictability. Since the variational principle corre- 
sponded to a rate-distortion theory known analysis meth- 
ods could be employed. Solutions to the objective func- 
tion were found using an iterative algorithm, and the 
rate-distortion curve was computed using deterministic 
annealing. 

For certain processes we calculated the curve analyti- 
cally. These and a numerical example served to demon- 
strate how its shape reveals a process's causal compress- 
ibility, providing direct guidance for automated model 
making. In particular, we showed how a model distin- 
guishes between what it effectively considers to be under- 
lying structure and what is noise. Practically speaking, 
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natural processes that have high causal compressibility 
will admit particularly parsimonious theories that cap- 
ture a large fraction of observed behavior. 

We pointed out that OCI finds the causal-state par- 
tition exactly when the constraint on model complexity 
is relaxed. Then we showed how to automatically build 
models with varying degrees of abstraction. By focusing 
on the case in which limitations due to finite sampling 
errors are absent, we emphasized that compact represen- 
tations, in and of themselves, are critical aids to scientific 
understanding. We pointed out, however, that finite data 



set size imposes a maximum level of allowable accuracy 
before overfitting occurs and that previous results can be 
used to find that demarcation line as well. 
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